431 Class 26

Thomas E. Love, Ph.D.

2023-12-07

Today’s Agenda

  1. Graphical and Numerical Summaries of Data
  2. 432 preview: A tidymodels example with interaction
  3. Takeaways from 431

Today’s Packages

library(knitr); library(kableExtra)
library(janitor); library(mosaic)
library(patchwork); library(broom)
library(tidyverse)

theme_set(theme_bw())

and a couple of secrets, hidden for now.

Visualizing Data

New Data Set 1

response n missing mean sd
y 142 0 47.83 26.94
x 142 0 54.27 16.77

13 Data Sets in the df tibble:

   set   n mean_x  sd_x mean_y  sd_y corr_xy
1    1 142  54.27 16.77  47.83 26.94   -0.06
2    2 142  54.27 16.77  47.83 26.94   -0.07
3    3 142  54.27 16.76  47.84 26.93   -0.07
4    4 142  54.26 16.77  47.83 26.94   -0.06
5    5 142  54.26 16.77  47.84 26.93   -0.06
6    6 142  54.26 16.77  47.83 26.94   -0.06
7    7 142  54.27 16.77  47.84 26.94   -0.07
8    8 142  54.27 16.77  47.84 26.94   -0.07
9    9 142  54.27 16.77  47.83 26.94   -0.07
10  10 142  54.27 16.77  47.84 26.93   -0.06
11  11 142  54.27 16.77  47.84 26.94   -0.07
12  12 142  54.27 16.77  47.83 26.94   -0.07
13  13 142  54.26 16.77  47.84 26.93   -0.07

New Data: Model for Set 1

set_1 <- lm(y ~ x, data = df |> filter(set == 1))

tidy(set_1, conf.int = T, conf.level = 0.9) |> 
  select(-statistic, -p.value) |> kable(digits = 2)
term estimate std.error conf.low conf.high
(Intercept) 53.43 7.69 40.69 66.16
x -0.10 0.14 -0.33 0.12
glance(set_1) |>
  select(r.squared, adj.r.squared, sigma, BIC, p.value) |>
  kable(digits = 3)
r.squared adj.r.squared sigma BIC p.value
0.004 -0.003 26.98 1351.64 0.448

All 13 Models, at a glance

dataset r.squared adj.r.squared sigma AIC BIC
1 0.004 -0.003 27 1343 1352
2 0.005 -0.002 27 1343 1352
3 0.005 -0.002 27 1343 1351
4 0.004 -0.003 27 1343 1352
5 0.004 -0.003 27 1343 1352
6 0.004 -0.003 27 1343 1352
7 0.005 -0.002 27 1343 1352
8 0.005 -0.002 27 1343 1352
9 0.005 -0.002 27 1343 1352
10 0.004 -0.003 27 1343 1352
11 0.005 -0.002 27 1343 1352
12 0.004 -0.003 27 1343 1352
13 0.004 -0.003 27 1343 1352

Plot for Data Set 1

Does a linear model for y using x seem appropriate?

https://xkcd.com/1725/

Set 1 Plot (+lm, +loess)

Model 1 (linear model for Set 1)


Call:
lm(formula = y ~ x, data = filter(df, set == 1))

Coefficients:
(Intercept)            x  
     53.425       -0.103  

Residual Plots for Set 1 Model

The Other 12 Data Sets

Models 2-13 all look about the same in terms of means, medians, correlations, regression models, but what happens if we plot the data?

temp2 <- df |> filter(set != 1) 

ggplot(temp2, aes(x = x, y = y, color = dataset)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~ set, labeller = "label_both")

The Other 12 Data Sets

Actually, each set has a name

And a linear model yields the same fit for each

How about a loess smooth with default span?

And the data come from

These are, of course, the datasauRus dozen data sets described in Spiegelhalter, available in the datasauRus package, which you can install from CRAN, thanks to the work of Steph Locke.

library(datasauRus)
df <- datasaurus_dozen
  • These were created by Alberto Cairo, who has some great books like How Charts Lie

The moral of the story: Never trust summary statistics alone, always visualize your data

Two Cool Things, available online…

  1. We’ll visit Tomas Westlake’s work at https://r-mageddon.netlify.app/post/reanimating-the-datasaurus/
library(datasauRus)
library(ggplot2)
library(gganimate)

ggplot(datasaurus_dozen, aes(x=x, y=y))+
  geom_point()+
  theme_minimal() +
  transition_states(dataset, 3, 1)

Two Cool Things, available online…

  1. Next, we’ll visit https://www.autodesk.com/research/publications/same-stats-different-graphs

This is Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing by Justin Matejka and George Fitzmaurice.

We’ll look at a couple of the animated plots they generate there.

A Taste of 432: Sea Urchins and Tidy Modeling

Sea Urchins and Tidy Modeling

Constable (1993) compared the inter-radial suture widths of urchins maintained on one of three food regimes

  • Initial: no additional food supplied above what was in the initial sample
  • Low: food supplied periodically
  • High: food supplied ad libitum (as often as desired)

In an attempt to control for substantial variability in urchin sizes, the initial body volume of each urchin was measured as a covariate.

  • This example comes from https://www.tidymodels.org/start/models/
  • Another key source is https://www.flutterbys.com.au/stats/tut/tut7.5a.html
  • Data from Constable, A.J. The role of sutures in shrinking of the test in Heliocidaris erythrogramma (Echinoidea: Echinometridae). Marine Biology 117, 423-430 (1993). https://doi.org/10.1007/BF00349318

Package Load / Data ingest (Sea Urchins)

library(tidymodels)
library(readr)
library(broom.mixed)

urchins <-
  # Data were assembled for a tutorial 
  # at https://www.flutterbys.com.au/stats/tut/tut7.5a.html
  read_csv("https://tidymodels.org/start/models/urchins.csv") |> 
  # Change the names to be a little more verbose
  setNames(c("food_regime", "initial_volume", "width")) |> 
  mutate(food_regime = 
           factor(food_regime, 
                  levels = c("Initial", "Low", "High")))

The urchins data

For each of 72 sea urchins, we know their

  • experimental feeding regime group (food_regime: either Initial, Low, or High),
  • size in milliliters at the start of the experiment (initial_volume), and
  • suture width at the end of the experiment (width).
glimpse(urchins)
Rows: 72
Columns: 3
$ food_regime    <fct> Initial, Initial, Initial, Initial, Initial, Initial, I…
$ initial_volume <dbl> 3.5, 5.0, 8.0, 10.0, 13.0, 13.0, 15.0, 15.0, 16.0, 17.0…
$ width          <dbl> 0.010, 0.020, 0.061, 0.051, 0.041, 0.061, 0.041, 0.071,…

Plot the Data

How should we model the data?

Since the slopes appear to be different for at least two of the feeding regimes, let’s build a model that allows for two-way interactions. We’ll use a linear model for width which allows each food regime to generate a different slope and intercept for the effect of initial volume.

lm(width ~ initial_volume * food_regime, data = urchins) |> tidy() |> 
  select(term, estimate) |> kable(dig = 4) |> kable_styling(font_size = 24)
term estimate
(Intercept) 0.0331
initial_volume 0.0016
food_regimeLow 0.0198
food_regimeHigh 0.0214
initial_volume:food_regimeLow -0.0013
initial_volume:food_regimeHigh 0.0005

Setting up a linear regression with tidymodels

lm_mod <-
  linear_reg()  |> 
  set_engine("lm")

lm_mod
Linear Regression Model Specification (regression)

Computational engine: lm 

It turns out that we’ll have several options for engines here.

We can estimate or train the model with fit()

lm_fit <- 
  lm_mod |>
  fit(width ~ initial_volume * food_regime, data = urchins)

We’ll look at the results on the next slide.

What’s in lm_fit?

tidy(lm_fit, conf.int = TRUE) |> select(term, estimate, conf.low, conf.high) |> 
  kable(dig = 4) |> kable_styling(font_size = 28)
term estimate conf.low conf.high
(Intercept) 0.0331 0.0139 0.0523
initial_volume 0.0016 0.0008 0.0023
food_regimeLow 0.0198 -0.0061 0.0457
food_regimeHigh 0.0214 -0.0076 0.0504
initial_volume:food_regimeLow -0.0013 -0.0023 -0.0002
initial_volume:food_regimeHigh 0.0005 -0.0009 0.0019

Make Predictions

Suppose that, for a publication, it would be particularly interesting to make a plot of the mean body size for urchins that started the experiment with an initial volume of 20ml. To create such a graph, we start with some new example data that we will make predictions for.

new_points <- expand.grid(initial_volume = 20, 
                          food_regime = c("Initial", "Low", "High"))

new_points
  initial_volume food_regime
1             20     Initial
2             20         Low
3             20        High

Obtain Predicted Results for these new_points

We’ll develop mean predictions and uncertainty intervals.

mean_pred <- predict(lm_fit, new_data = new_points)
conf_int_pred <- predict(lm_fit, 
                         new_data = new_points, 
                         type = "conf_int")
plot_data <- 
  new_points |> 
  bind_cols(mean_pred) |> 
  bind_cols(conf_int_pred)

Plot the plot_data results

ggplot(plot_data, aes(x = food_regime, y = .pred)) + 
  geom_point() + 
  geom_errorbar(aes(ymin = .pred_lower, 
                    ymax = .pred_upper),
                width = .2) + 
  labs(y = "urchin size",
       title = "Linear model fit using `lm`")

Plot the plot_data results

Could we use Bayesian methods?

Would the results be different if we used a Bayesian approach?

  • Need to select a prior.
  • Let’s use bell-shaped priors on the intercepts and slopes, using a Cauchy distribution (works out to be the same as a t distribution with one degree of freedom)
  • The stan_glm() function can be used, and this is available as an engine in tidymodels, where we need to specify prior and prior_intercept to fit a linear model.

Setting up a Bayesian Model

prior_dist <- rstanarm::student_t(df = 1)

set.seed(123)

bayes_mod <-   
  linear_reg() |> 
  set_engine("stan", 
             prior_intercept = prior_dist, 
             prior = prior_dist) 

Training the Bayes Model

bayes_fit <- bayes_mod |> 
  fit(width ~ initial_volume * food_regime, data = urchins)

tidy(bayes_fit, conf.int = TRUE) |> 
  select(term, estimate, conf.low, conf.high) |> 
  kable(dig = 4) |> kable_styling(font_size = 24)
term estimate conf.low conf.high
(Intercept) 0.0330 0.0167 0.0492
initial_volume 0.0016 0.0009 0.0023
food_regimeLow 0.0198 -0.0020 0.0415
food_regimeHigh 0.0216 -0.0027 0.0460
initial_volume:food_regimeLow -0.0013 -0.0021 -0.0004
initial_volume:food_regimeHigh 0.0005 -0.0007 0.0017

Building the plot for the Bayes model

bayes_plot_data <- 
  new_points |> 
  bind_cols(predict(bayes_fit, new_data = new_points)) |> 
  bind_cols(predict(bayes_fit, new_data = new_points, 
                    type = "conf_int"))

ggplot(bayes_plot_data, aes(x = food_regime, y = .pred)) + 
  geom_point() + 
  geom_errorbar(aes(ymin = .pred_lower, ymax = .pred_upper), 
                width = .2) + 
  labs(y = "urchin size",
       title = "Bayesian model with t(1) prior distribution")

Building the plot for the Bayes model

Comparing the Models

The models aren’t actually the same

What are we plotting, actually?

plot_data |> kable(dig = 4) |> kable_styling(font_size = 28)
initial_volume food_regime .pred .pred_lower .pred_upper
20 Initial 0.0642 0.0555 0.0729
20 Low 0.0588 0.0499 0.0678
20 High 0.0961 0.0870 0.1053
bayes_plot_data |> kable(dig = 4) |> kable_styling(font_size = 28)
initial_volume food_regime .pred .pred_lower .pred_upper
20 Initial 0.0642 0.0554 0.0729
20 Low 0.0588 0.0498 0.0677
20 High 0.0960 0.0869 0.1055

What do we take away from 431 at the end of the day?

Ten Simple Rules for Effective Statistical Practice

From PLoS Computational Biology

  1. Statistical Methods Should Enable Data to Answer Scientific Questions
  2. Signals Always Come with Noise
  3. Plan Ahead, Really Ahead
  4. Worry About Data Quality
  5. Statistical Analysis Is More Than a Set of Computations

Ten Simple Rules for Effective Statistical Practice

From PLoS Computational Biology

  1. Keep it Simple
  2. Provide Assessments of Variability
  3. Check Your Assumptions
  4. When Possible, Replicate!
  5. Make Your Analysis Reproducible

The Impact of Study Design (Gelman)

Applied statistics is hard.

  • Doing a statistical analysis is like playing basketball, or knitting a sweater. You can get better with practice.
  • Incompetent statistics does not necessarily doom a research paper: some findings are solid enough that they show up even when there are mistakes in the data collection and data analyses. But we’ve also seen many examples where incompetent statistics led to conclusions that made no sense but still received publication and publicity.
  • We should be thinking not just about data analysis, but also data quality.

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. (R. A. Fisher)

What does sad p-value bear lead to?

So you collected data and analyzed the results. Now you want to do an after data gathering (post hoc) power analysis.

  1. What will you use as your “true” effect size?
    • Often, point estimate from data - yuck - results very misleading - power is generally seriously overestimated when computed on the basis of statistically significant results.
    • Much better (but rarer) to identify plausible effect sizes based on external information rather than on your sparkling new result.
  2. What are you trying to do? (too often)
    • get researcher off the hook (I didn’t get p < 0.05 because I had low power - an alibi to explain away non-significant findings) or
    • encourage overconfidence in the finding.

None of this is particularly smart.

Build Tidy Data Sets

  • Each variable you measure should be in one column.
  • Each different observation of that variable should be in a different row.
  • There should be one table for each “kind” of variable.
  • If you have multiple tables, they should include a column in the table that allows them to be linked.
  • Include a row at the top of each data table that contains real row names. Age_at_Diagnosis is a much much better name than ADx.
  • Build useful codebooks.

Jeff Leek: “How to share data with a statistician

A Tip from David Robinson

Ten of the Most Important 431 Ideas

  1. You have to visualize and count data to understand it.
  2. 90% of statistical work could be described as data management.
  3. R Markdown and the tidyverse make it easier to do the right thing.
  4. Statistical significance is not a helpful concept.
  5. Point estimates and confidence intervals are useful ideas.

Ten of the Most Important 431 Ideas

  1. Most statistical procedures are in fact regression models.
  2. All statistical methods involve assumptions worth checking.
  3. The bootstrap is a very useful, and somewhat underused tool.
  4. Prediction models need to predict well in new situations.
  5. Statistical thinking is far too important to be left to statisticians.

Session Information

sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] broom.mixed_0.2.9.4 yardstick_1.2.0     workflowsets_1.0.1 
 [4] workflows_1.1.3     tune_1.1.2          rsample_1.2.0      
 [7] recipes_1.0.8       parsnip_1.1.1       modeldata_1.2.0    
[10] infer_1.0.5         dials_1.2.0         scales_1.3.0       
[13] tidymodels_1.1.1    lubridate_1.9.3     forcats_1.0.0      
[16] stringr_1.5.1       purrr_1.0.2         readr_2.1.4        
[19] tidyr_1.3.0         tibble_3.2.1        tidyverse_2.0.0    
[22] broom_1.0.5         patchwork_1.1.3     mosaic_1.9.0       
[25] mosaicData_0.20.4   ggformula_0.12.0    dplyr_1.1.4        
[28] Matrix_1.6-4        ggplot2_3.4.4       lattice_0.21-9     
[31] janitor_2.2.0       kableExtra_1.3.4    knitr_1.45         
[34] datasauRus_0.1.6   

loaded via a namespace (and not attached):
  [1] shinythemes_1.2.0    splines_4.3.2        later_1.3.1         
  [4] hardhat_1.3.0        xts_0.13.1           rpart_4.1.21        
  [7] lifecycle_1.0.4      StanHeaders_2.26.28  globals_0.16.2      
 [10] processx_3.8.2       vroom_1.6.4          MASS_7.3-60         
 [13] crosstalk_1.2.1      backports_1.4.1      magrittr_2.0.3      
 [16] rmarkdown_2.25       yaml_2.3.7           httpuv_1.6.12       
 [19] pkgbuild_1.4.2       minqa_1.2.6          abind_1.4-5         
 [22] rvest_1.0.3          tensorA_0.36.2       nnet_7.3-19         
 [25] ipred_0.9-14         labelled_2.12.0      lava_1.7.3          
 [28] inline_0.3.19        listenv_0.9.0        parallelly_1.36.0   
 [31] svglite_2.1.2        codetools_0.2-19     DT_0.30             
 [34] xml2_1.3.5           tidyselect_1.2.0     rstanarm_2.26.1     
 [37] bayesplot_1.10.0     farver_2.1.1         lme4_1.1-35.1       
 [40] matrixStats_1.1.0    stats4_4.3.2         base64enc_0.1-3     
 [43] webshot_0.5.5        jsonlite_1.8.7       ellipsis_0.3.2      
 [46] ggridges_0.5.4       survival_3.5-7       iterators_1.0.14    
 [49] systemfonts_1.0.5    foreach_1.5.2        tools_4.3.2         
 [52] Rcpp_1.0.11          glue_1.6.2           prodlim_2023.08.28  
 [55] gridExtra_2.3        xfun_0.41            mgcv_1.9-0          
 [58] distributional_0.3.2 loo_2.6.0            withr_2.5.2         
 [61] fastmap_1.1.1        boot_1.3-28.1        fansi_1.0.5         
 [64] shinyjs_2.1.0        callr_3.7.3          digest_0.6.33       
 [67] timechange_0.2.0     R6_2.5.1             mime_0.12           
 [70] colorspace_2.1-0     gtools_3.9.5         markdown_1.11       
 [73] threejs_0.3.3        utf8_1.2.4           generics_0.1.3      
 [76] data.table_1.14.8    class_7.3-22         prettyunits_1.2.0   
 [79] httr_1.4.7           htmlwidgets_1.6.3    pkgconfig_2.0.3     
 [82] dygraphs_1.1.1.6     gtable_0.3.4         timeDate_4022.108   
 [85] GPfit_1.0-8          furrr_0.3.1          htmltools_0.5.7     
 [88] posterior_1.5.0      gower_1.0.1          snakecase_0.11.1    
 [91] rstudioapi_0.15.0    tzdb_0.4.0           reshape2_1.4.4      
 [94] checkmate_2.3.0      nloptr_2.0.3         nlme_3.1-163        
 [97] curl_5.1.0           zoo_1.8-12           parallel_4.3.2      
[100] miniUI_0.1.1.1       pillar_1.9.0         grid_4.3.2          
[103] vctrs_0.6.5          shinystan_2.6.0      promises_1.2.1      
[106] xtable_1.8-4         lhs_1.1.6            evaluate_0.23       
[109] cli_3.6.1            compiler_4.3.2       rlang_1.1.2         
[112] crayon_1.5.2         rstantools_2.3.1.1   future.apply_1.11.0 
[115] labeling_0.4.3       ps_1.7.5             plyr_1.8.9          
[118] stringi_1.8.2        rstan_2.32.3         viridisLite_0.4.2   
[121] QuickJSR_1.0.8       munsell_0.5.0        colourpicker_1.3.0  
[124] mosaicCore_0.9.4.0   V8_4.4.0             hms_1.1.3           
[127] bit64_4.0.5          future_1.33.0        shiny_1.8.0         
[130] highr_0.10           haven_2.5.4          igraph_1.5.1        
[133] RcppParallel_5.1.7   bit_4.0.5            DiceDesign_1.9